Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add posibility to parse multiple line definitions #378

Closed
wants to merge 15 commits into from

Conversation

bosd
Copy link
Collaborator

@bosd bosd commented May 25, 2022

Supersedes #225

Fixes: #155 No warning with lines - Solved in a different manner to show why something matches and on which setting.
Fixes: #224 Support for multiple lines using lines plugin - With addition of line contentating.
Fixes: #186 Support for multiple line groups in a template
Fixes: #238 Is there a support for multiple regex for lines plugin?
Closes: #377 Parse Multiple lines in the same field

Refactored code to allow multiple lines to be parsed.
Restructuring was needed.
As we want to output the lines in the order of the invoice file.
Independent of the order of the template file.

Example yaml input

lines:
 - start: Barcode
   end:  Netto totaal
   line: (?P<line_note>(FOOD))
 - start: Barcode
   end:  Netto totaal
   line: (?P<description>(.+))\s+(?P<price>\d+\d+)

it is also possible to contentate multi line tags.
alternative example sammymaystone.

    lines:
     - start: Item\s+Quantity\s+Rate\s+Amount
       end:  Subtotal
       first_line: 'Service (?P<item>\w)\s+(?P<qty>\S+)\s+.(?P<unitprice>\d+.\d{2})\s+.?(?P<linetotal>\d+.\d{2})'
       line: '(?P<item>.*)'
       skip_line: ['Pino \w'] # 'Description:', Notes:,
       last_line: '(?P<desc>Parts:.*)'
       types:
         qty: int
         unitprice: float
         linetotal: float

input:

Service A                                                                 12                      0.00         20.00
Description: Repair
Notes: Replaced capacitor
Parts: 1 x cap_a
Tax: 0.2%

output:
{'item': 'A\nDescription: Repair\nNotes: Replaced capacitor', 'qty': 12, 'unitprice': 10.0, 'linetotal': 120.0, 'desc': 'Parts: 1 x cap_a'}

bosd

refactored code to allow multiple lines to be parsed.

Restructuring was needed.

As we want to output the lines in the order of the invoice file.

Independent of the order of the template file.

Example yaml input

lines:

 - start: Barcode

   end:  Netto totaal

   line: (?P<line_note>(FOOD))

 - start: Barcode

   end:  Netto totaal

   line: (?P<description>(.+))\s+(?P<price>\d+\d+)

it is also possible to contentate multi line tags.

alternative example sammymaystone.

lines:

 - start: Item\s+Quantity\s+Rate\s+Amount

   end:  Subtotal

   first_line: 'Service (?P<item>\w)\s+(?P<qty>\S+)\s+.(?P<unitprice>\d+.\d{2})\s+.?(?P<linetotal>\d+.\d{2})'

   line: '(?P<item>.*)'

   skip_line: ['Pino \w'] # 'Description:', Notes:,

   last_line: '(?P<desc>Parts:.*)'

   types:

     qty: int

     unitprice: float

     linetotal: float

input:

Service A                                                                 12                      0.00         20.00

Description: Repair

Notes: Replaced capacitor

Parts: 1 x cap_a

Tax: 0.2%

output:

{'item': 'A\nDescription: Repair\nNotes: Replaced capacitor', 'qty': 12, 'unitprice': 10.0, 'linetotal': 120.0, 'desc': 'Parts: 1 x cap_a'}
@m3nu
Copy link
Collaborator

m3nu commented May 29, 2022

Thanks for the contribution! Will test this locally first.

@m3nu
Copy link
Collaborator

m3nu commented May 29, 2022

Test should pass. Linting in this case.

@bosd
Copy link
Collaborator Author

bosd commented May 29, 2022

Flake8 was not willing to install on my local system.

So fixed the errors manually. Could'nt test, but flake8 should be happy now.

@bosd
Copy link
Collaborator Author

bosd commented Jun 15, 2022

@m3nu Should be fixed now, can you trigger the tests?

@rmilecki
Copy link
Collaborator

I'm quite a bit confused by this pull request. I'm not sure what does it do and what are you trying to fix/improve.

  1. A single commit modifying hundreds of lines is hard to follow. It'd be nicer to have few smaller commits that can be reviewed properly.
  2. You referenced issues that may be resolved already (Support for multiple lines using lines plugin #224 actually is)
  3. Your examples and - what's worse - updated documentation uses the old syntax, please use the parser: lines form

Please start with explaining your case.

  • Are you dealing with multiple invoice sections that need to be parsed with lines?
  • Are you dealing with a single invoice section that contains lines of multiple formats?

That example you provided ("Service A") could be parsed without this pull request so it doesn't really help to understand your changes.

@bosd
Copy link
Collaborator Author

bosd commented Jun 19, 2022

I'm quite a bit confused by this pull request. I'm not sure what does it do and what are you trying to fix/improve.

This contribution allow to parse all the invoice lines against different regexes.

I'm sorry, I was unable to do it in a smaller commit.
Basically it is a continuation of 225, but then ported to your contribution of the lines refactoring.
Simply put, each invoice line is tested against a setting which is calling the parser.

Invoice2data is used as a module for the importing of invoices in Odoo.
Previously it was not possible to import multiple lines in the desired format.
This also explains the contentation function. Will try to explain better.
In the input we have

Service A                                                                 12                      0.00         20.00
Description: Repair
Notes: Replaced capacitor
Parts: 1 x cap_a
Tax: 0.2%

Let's call that for now a section.

That section consists of multiple lines.
Lets focus on the description elements. Suppose in the ERP we need the following:
In the tag
item we need a Multiline description, by which the lines are separated by \n
Earlier it was not possible to achieve that. But with this commit we can generate in the lines tag:
'item': 'A\nDescription: Repair\nNotes: Replaced capacitor'
Thus, information parsed from multiple input lines, can be combined in a single tag.

With the alternative syntax, introduced by #308 (what you call new) it was not possible to create the desired output.
Therefore I would vote against the "new" syntax, because of breaking compatibility. (There are users with their own private templates)
And in my opinion it would be more difficult to generate the template lines from template creator.
(A template creator module is something I am working on, currently only exists in notes and in my mind)
If it was my decision, I would revert the "new" syntax. But hey, it is open source. It was quite cumbersome to also support the newly introduced alternative syntax. But decided to put in the extra hours, as maybe some people beside yourself started using it. So now users will have a choice which one to use.

In the "old" syntax all the line info was captured in the lines tag.
However, by the introduction of the "new" syntax it is possible to choose your own tag.
Like in the provided example com.sammymaystone.yaml. Instead us using

   lines:
    parser: lines
    start: 'Item\s+Quantity\s+Rate\s+Amount'
    first_line: 'Service (?P<item>\w)'
    line: '(?P<desc>.*)'
    skip_line: ['Description:', "Notes:"]
    last_line: '(?P<desc>Parts:.*)'
    end: 'Subtotal'

it reads (note the line_items ):

  line_items:
    parser: lines
    start: 'Item\s+Quantity\s+Rate\s+Amount'
    first_line: 'Service (?P<item>\w)'
    line: '(?P<desc>.*)'
    skip_line: ['Description:', "Notes:"]
    last_line: '(?P<desc>Parts:.*)'
    end: 'Subtotal'

The added flexibility may look as an improvement on first hand.
IMO, it is one step further away from a standard solution.
Now we get users using the lines tag and line_items and whatever variation users come up with.
Thus templates are not that easy to be universally transferred.
This is more an issue, for the "standard field names" discussion / PR but it is closely related/
If we want to translate the output from this module to an ERP in the field lines_erp, now to create a mapping entry for
lines to lines_erp
line_items to lines_erp
whatever_someone_makes_up to lines_erp

Sorry, getting a bit late at night typing this. I don't fully recall how the response is with your "new" syntax and contentating.
As in my use case, what you call the "old" syntax is used. In the output I need only one lines tag containing all the line items.
(see first post).
Not:

{'lines': {'all items matching the first parser call'}}

{'lines': {'all items matching the second parser call'}}

To add we want the lines output to be in the same order as they where found in the original input file.
So as the ERP re-assembles the invoice, we can do a full side by side compare and all the line items are matching. 👍

Didn't have the time to do it. But I really want to revert back the "new" syntax.
Or at least the "bad" sammy maystone example. As people may start to use it as a base for creating their own templates and only adding to the confusion.

Hope this is more clear. It's Getting to late now, will continue/improve this post later on 😪 💤

@bosd
Copy link
Collaborator Author

bosd commented Jun 20, 2022

With this PR it is also possible to use multiple line groups in the configuration.
Using the "alternative" syntax that is not possible, limited by Yaml syntax constraint.
Yaml does not allow the same key to be used multiple times.

Quick example of correct syntax for multiple line configurations.

lines:
 - start: Barcode
   end:  Netto totaal  
   line: (?P<line_note>(FOOD))
 - start: Barcode
   end:  Netto totaal
   line: (?P<barcode>(\d{13}))\s+(?P<name>(\w+(?:\s\S+)*))\s+(?P<qty>(\d))\s+(?P<uom>\w+)\s+(?P<price_unit>(\d+.\d+))\s+(?P<discount>\d+.\d+)

@m3nu
Copy link
Collaborator

m3nu commented Jun 23, 2022

@m3nu Should be fixed now, can you trigger the tests?

Check running. Sorry for the delay.

@bosd
Copy link
Collaborator Author

bosd commented Jun 24, 2022

Check running. Sorry for the delay.

NP!

The CI tests are failling. Gonna need some help and pointers why that's the case.
As I could not reproduce it locally.

Local tests are failling for different reasons:
Import Error on local unittests #380
FilenotfoundError #381
pdfminer not installed #379

@m3nu
Copy link
Collaborator

m3nu commented Jun 24, 2022

One is a failed test comparison. Maybe a different extraction result.

The other is a missing variable. Probably a typo.

>               content_of_setting = content[start.end() : end.start()]
E               AttributeError: 'NoneType' object has no attribute 'end'

https://github.com/invoice-x/invoice2data/runs/7024293832?check_suite_focus=true

@bosd
Copy link
Collaborator Author

bosd commented Jun 24, 2022

local tests are passing, with some code changes.
(path related, will create PR)
But I get the impression that the local tests I'm running are slightly different from the CI tests.

So it's a bit of trail and error. Hope you don't mind triggering the CI a couple of times. In case the latest commit doesn't fix it.

@rmilecki
Copy link
Collaborator

rmilecki commented Jun 26, 2022

@bosd: I feel you really don't like the new syntax, I'll have to argue a bit here.

I would vote against the "new" syntax, because of breaking compatibility

While I did my best not to beak backward compatibility, you claim I did. Please create a new issue providing details of YAML that stopped working for you after my changes. It's really important for me.

it would be more difficult to generate the template lines from template creator

I can't treat it as a serious argument against new syntax. It's about some unpublihsed software I can't even test. I can't imagine why it's so hard to generate a different YAML. Maybe your software needs some refactoring.

If it was my decision, I would revert the "new" syntax. But hey, it is open source.

Exactly, it's open source, please suggest a better soltuion for problems I attempted to resolve (resolved?) with my new syntax. Just reverting is not a solution.

However, by the introduction of the "new" syntax it is possible to choose your own tag.
Thus templates are not that easy to be universally transferred.

Invoice fields are not universally named. Line fields are not universally named. Yet you blame me for dropping one generic standarized (forced) field name. Can't you see the real problem is way bigger and we need standarized names for way more fields than just lines?


It feels annoying to have you blaming one feature (syntax) and keeping that semi off-topic discussion here. If at least you could suggest another solution. I wasn't the only person what wanted reusable lines parser, but it seems you didn't even care to read referenced issues. #186 and #224 which you claim to fix were actually fixed by introducing parsers syntax.

@rmilecki
Copy link
Collaborator

You spent quite some time criticizing the new syntax but not so much explaining your changes. Can we take a step back and discuss what do we need to support, please?


You provided a following example of input invoice:

Service A                                                                 12                      0.00         20.00
Description: Repair
Notes: Replaced capacitor
Parts: 1 x cap_a
Tax: 0.2%

and expected parsing output:

 'item': 'A\nDescription: Repair\nNotes: Replaced capacitor'

I took a moment to create a fake invoice with following content:

Item Quantity Rate Amount
Service A                     12   0.00   20.00
Description: Repair
Notes: Replaced capacitor
Parts: 1 x cap_a
Tax: 0.2%
Subtotal 20.00

and wrote a simple YAML with following lines entry:

  lines:
    parser: lines
    start: 'Item\s+Quantity\s+Rate\s+Amount'
    first_line: 'Service (?P<item>\w)'
    line: '(?P<item>.*)'
    skip_line: [ 'Parts:', 'Tax:' ]
    end: 'Subtotal'

parsing results in:

'lines': [{'item': 'A\nDescription: Repair\nNotes: Replaced capacitor'}],

which is exactly what you expected.


So no changes to invoice2data are needed to parse that example invoice section you provided.

Can you provide an actual example of invoice section that:

  1. Can't be parsed with existing invoice2data code
  2. Can be supported with your changes

@bosd
Copy link
Collaborator Author

bosd commented Jun 27, 2022

While I did my best not to beak backward compatibility, you claim I did.

Sorry, Should've been more clear. It is not broken. yet..
Moi point is aimed more at a comment the "old" syntax should be phased out in a year.

Can't you see the real problem is way bigger and we need standarized names for way more fields than just lines?

Yes, I agree the problems are bigger. But can't solve them all at once :) More PR's are coming.

I wasn't the only person what wanted reusable lines parser, but it seems you didn't even care to read referenced issues. #186 and #224 which you claim to fix were actually fixed by introducing parsers syntax.

I've looked in into those issues. Yet they did not resolve it for me.
I'll have to leave it at that for now as I'm just leaving a quick reply while commuting. Will do some more tests and provide an more extensive answer later on.

Can you provide an actual example of invoice section that:

  1. Can't be parsed with existing invoice2data code
  2. Can be supported with your changes

Yes, Will do

Ideally I would like to add more "advanced" examples.
(and an example included in this PR.)
I've made a set of templates but they are not ready to be committed yet. This PR is just a small step in getting them processed.
There is still some more work needed like:

  • Standard field names
  • Ability to parse HTML style table invoices

It's coming, just need more dev time 😃

@rmilecki
Copy link
Collaborator

Moi point is aimed more at a comment the "old" syntax should be phased out in a year.

Good point! Given how little maintainance it takes to keep old syntax support I think we should keep it as long as it doesn't cause serious development / maintenance problems.

Yes, Will do

Ideally I would like to add more "advanced" examples.
(and an example included in this PR.)

Perfect. I think we must have a real example to understand why this change is needed and review it properly. For me it's a really bad idea to merge code that we don't know what it's supposed to solve.

@bosd
Copy link
Collaborator Author

bosd commented Jun 28, 2022

Ok, going back to the original issue in #377
Parse Multiple lines in the same field.

When passing multiple line arguments in the template I won't get the desired outcome.
Made an fictitious Example of an invoice which cannot be parsed. It's inspired by a real wholesaler invoice.
Mekro.pdf

Template (old syntax):

# -*- coding: utf-8 -*-
issuer: Mekro
fields:
  amount: To Pay\s+(\d+.\d{2})
  amount_untaxed: Netto totaal[:]\s+(\d+[,]\d{2})
  date: Invoicedate\s.?\s+(\d{2}-\d{2}-\d{4})\s+\d{2}[:]\d{2}
  invoice_number: Invoicenumber[:]\s+(\S+)
  iban:
    parser: static
    value: NL44INGB0702593702
  partner_coc:
    parser: regex
    regex: '33166113'
  partner_website:
    parser: regex
    regex: mekro.nl
lines:
 - start: Barcode
   end:  Netto totaal  
   line: (?P<line_note>(---FOOD ITEMS---))
 - start: Barcode
   end:  Netto totaal  
   line: (?P<line_note>(---OTHER ITEMS---))
 - start: Barcode
   end:  Netto totaal
   line: (?P<barcode>(\d{13}))\s+(?P<name>(\w+(?:\s\S+)*))\s+(?P<qty>(\d))\s+(?P<uom>\w+)\s+(?P<price_unit>(\d+.\d+))\s+(?P<discount>\d+.\d+)
keywords:
  - Mekro
  - NL001799434B01
options:
  date_formats:
    - '%d %m %Y'
  currency: EUR
  languages:
    - en
  decimal_separator: ','

While running it against the code before this pr:

'''
Traceback (most recent call last):
File "/home/bosd/.local/bin/invoice2data", line 10, in
sys.exit(main())
File "/home/bosd/.local/lib/python3.7/site-packages/invoice2data/main.py", line 212, in main
res = extract_data(f.name, templates=templates, input_module=input_module)
File "/home/bosd/.local/lib/python3.7/site-packages/invoice2data/main.py", line 95, in extract_data
return t.extract(optimized_str)
File "/home/bosd/.local/lib/python3.7/site-packages/invoice2data/extract/invoice_template.py", line 218, in extract
plugin_func.extract(self, optimized_str, output)
File "/home/bosd/.local/lib/python3.7/site-packages/invoice2data/extract/plugins/lines.py", line 13, in extract
lines = parsers.lines.parse(self, "lines", self["lines"], content)
File "/home/bosd/.local/lib/python3.7/site-packages/invoice2data/extract/parsers/lines.py", line 20, in parse
settings.update(_settings)
ValueError: dictionary update sequence element #0 has length 3; 2 is required

'''

Ok, that template was using the 'old' syntax.
Let's do it again with the 'new' syntax template.

# -*- coding: utf-8 -*-
issuer: Mekro
fields:
  amount: To Pay\s+(\d+.\d{2})
  amount_untaxed: Netto totaal[:]\s+(\d+[,]\d{2})
  date: Invoicedate\s.?\s+(\d{2}-\d{2}-\d{4})\s+\d{2}[:]\d{2}
  invoice_number: Invoicenumber[:]\s+(\S+)
  iban:
    parser: static
    value: NL44INGB0702593702
  partner_coc:
    parser: regex
    regex: '33166113'
  partner_website:
    parser: regex
    regex: mekro.nl
## new test here
  lines:
    parser: lines
    start: Barcode
    line: (?P<line_note>(---FOOD ITEMS---))
    end: Netto totaal
  lines:
    parser: lines
    start: Barcode
    line: (?P<line_note>(---OTHER ITEMS---))
    end: Netto totaal
  lines:
    parser: lines
    start: Barcode
    line: (?P<barcode>(\d{13}))\s+(?P<name>(\w+(?:\s\S+)*))\s+(?P<qty>(\d))\s+(?P<uom>\w+)\s+(?P<price_unit>(\d+.\d+))\s+(?P<discount>\d+.\d+)
    end: Netto totaal
keywords:
  - Mekro
  - NL001799434B01
options:
  date_formats:
    - '%d %m %Y'
  currency: EUR
  languages:
    - en
  decimal_separator: ','

Result:

[
    {
        "amount": 15.5,
        "amount_untaxed": 5.5,
        "currency": "EUR",
        "date": "2021-01-11",
        "desc": "Invoice from Mekro",
        "iban": "NL44INGB0702593702",
        "invoice_number": "0/0(057)0004/041503",
        "issuer": "Mekro",
        "lines": [
            {
                "barcode": "2231012001992",
                "discount": "0,0",
                "name": "KROKETBROODJES",
                "price_unit": "1,00",
                "qty": "2",
                "uom": "KG"
            },
            {
                "barcode": "8713009019455",
                "discount": "0,0",
                "name": "Oil",
                "price_unit": "0,50",
                "qty": "3",
                "uom": "L"
            },
            {
                "barcode": "8713009019475",
                "discount": "0,0",
                "name": "Apple",
                "price_unit": "50,0",
                "qty": "1",
                "uom": "KG"
            },
            {
                "barcode": "8713009019375",
                "discount": "100,0",
                "name": "programmer",
                "price_unit": "50,0",
                "qty": "1",
                "uom": "Hour"
            },
            {
                "barcode": "0013009019475",
                "discount": "0,0",
                "name": "Sticker",
                "price_unit": "0,0",
                "qty": "1",
                "uom": "pce"
            }
        ],
        "partner_coc": "33166113",
        "partner_website": "mekro.nl"
    }
]

It parses 🎉
Much better, but it only parses using the last line setting in the template file.
Note The following is missing:
---FOOD ITEMS---
---OTHER ITEMS---
Thus it only seems to parse the last line definition block from our template.
(Actually it parses all, but overwrites the output.)

Now, lets use the "new syntax" template with the code in this pr:

[
    {
        "amount": 15.5,
        "amount_untaxed": 5.5,
        "currency": "EUR",
        "date": "2021-01-11",
        "desc": "Invoice from Mekro",
        "iban": "NL44INGB0702593702",
        "invoice_number": "0/0(057)0004/041503",
        "issuer": "Mekro",
        "lines": [
            {
                "barcode": "2231012001992",
                "discount": "0,0",
                "name": "KROKETBROODJES",
                "price_unit": "1,00",
                "qty": "2",
                "uom": "KG"
            },
            {
                "barcode": "8713009019455",
                "discount": "0,0",
                "name": "Oil",
                "price_unit": "0,50",
                "qty": "3",
                "uom": "L"
            },
            {
                "barcode": "8713009019475",
                "discount": "0,0",
                "name": "Apple",
                "price_unit": "50,0",
                "qty": "1",
                "uom": "KG"
            },
            {
                "barcode": "8713009019375",
                "discount": "100,0",
                "name": "programmer",
                "price_unit": "50,0",
                "qty": "1",
                "uom": "Hour"
            },
            {
                "barcode": "0013009019475",
                "discount": "0,0",
                "name": "Sticker",
                "price_unit": "0,0",
                "qty": "1",
                "uom": "pce"
            }
        ],
        "partner_coc": "33166113",
        "partner_website": "mekro.nl"
    }
]

🎉 No Difference!

Ok, new lets use the "old" syntax template with the code from this PR:

[
    {
        "amount": 15.5,
        "amount_untaxed": 5.5,
        "currency": "EUR",
        "date": "2021-01-11",
        "desc": "Invoice from Mekro",
        "iban": "NL44INGB0702593702",
        "invoice_number": "0/0(057)0004/041503",
        "issuer": "Mekro",
        "lines": [
            {
                "line_note": "---FOOD ITEMS---"
            },
            {
                "barcode": "2231012001992",
                "discount": "0,0",
                "name": "KROKETBROODJES",
                "price_unit": "1,00",
                "qty": "2",
                "uom": "KG"
            },
            {
                "barcode": "8713009019455",
                "discount": "0,0",
                "name": "Oil",
                "price_unit": "0,50",
                "qty": "3",
                "uom": "L"
            },
            {
                "barcode": "8713009019475",
                "discount": "0,0",
                "name": "Apple",
                "price_unit": "50,0",
                "qty": "1",
                "uom": "KG"
            },
            {
                "line_note": "---OTHER ITEMS---"
            },
            {
                "barcode": "8713009019375",
                "discount": "100,0",
                "name": "programmer",
                "price_unit": "50,0",
                "qty": "1",
                "uom": "Hour"
            },
            {
                "barcode": "0013009019475",
                "discount": "0,0",
                "name": "Sticker",
                "price_unit": "0,0",
                "qty": "1",
                "uom": "pce"
            }
        ],
        "partner_coc": "33166113",
        "partner_website": "mekro.nl"
    }
]

✨🥇 Now we have a full json representation of the lines from the input file.

@rmilecki
Copy link
Collaborator

I'll try to review this this Sunday (Cc @m3nu)

@bosd
Copy link
Collaborator Author

bosd commented Sep 19, 2022

This one depends upon: #394

@bosd bosd force-pushed the multiline branch 3 times, most recently from fff0526 to 5048cdf Compare September 24, 2022 11:24
bosd

refactored code to allow multiple lines to be parsed.

Restructuring was needed.

As we want to output the lines in the order of the invoice file.

Independent of the order of the template file.

Example yaml input

lines:

 - start: Barcode

   end:  Netto totaal

   line: (?P<line_note>(FOOD))

 - start: Barcode

   end:  Netto totaal

   line: (?P<description>(.+))\s+(?P<price>\d+\d+)

it is also possible to contentate multi line tags.

alternative example sammymaystone.

lines:

 - start: Item\s+Quantity\s+Rate\s+Amount

   end:  Subtotal

   first_line: 'Service (?P<item>\w)\s+(?P<qty>\S+)\s+.(?P<unitprice>\d+.\d{2})\s+.?(?P<linetotal>\d+.\d{2})'

   line: '(?P<item>.*)'

   skip_line: ['Pino \w'] # 'Description:', Notes:,

   last_line: '(?P<desc>Parts:.*)'

   types:

     qty: int

     unitprice: float

     linetotal: float

input:

Service A                                                                 12                      0.00         20.00

Description: Repair

Notes: Replaced capacitor

Parts: 1 x cap_a

Tax: 0.2%

output:

{'item': 'A\nDescription: Repair\nNotes: Replaced capacitor', 'qty': 12, 'unitprice': 10.0, 'linetotal': 120.0, 'desc': 'Parts: 1 x cap_a'}
@bosd
Copy link
Collaborator Author

bosd commented Sep 24, 2022

Local tests are ok. Odd that there is a difference in tests.

@bosd bosd added this to the 0.4.0 release milestone Sep 24, 2022
@bosd bosd requested review from m3nu and rmilecki September 24, 2022 18:11
@bosd
Copy link
Collaborator Author

bosd commented Sep 24, 2022

Another test solved! ✨
Only pdfminer issue left #402

@rmilecki
Copy link
Collaborator

rmilecki commented Sep 25, 2022

This pull request rewrites a lot of src/invoice2data/extract/parsers/lines.py code and is really hard for me to review properly. It also modifies parsers code to while focusing on syntax handled by the plugins code. I think those changes don't really belong here.

@bosd I know you spent a lot of time on this but could you please consider my possibly an alternative soluion: #407 , please? My changes should implement the same feature by just adding 15 new lines to the lines.py parser.
On top of my changes you can easily extend lines plugin syntax if you think we really need that, see #408

@bosd
Copy link
Collaborator Author

bosd commented Sep 25, 2022

This pull request rewrites a lot of src/invoice2data/extract/parsers/lines.py code and is really hard for me to review properly.

True, it's quite a rewrite. It could even be further improved by dumping all the lines in an array. Possibly that would elliminate some of the for loops. (for loops are known for slowing python down)
But at this point I'm not willing to do that. As I already spent lot of time getting this functionality. (for now my focus is on getting data properly parsed instead of speed)

This one is ready! (Apart from the messy commits, propably easier to make a new branch and pr)

It also modifies parsers code to while focusing on syntax handled by the plugins code. I think those changes don't really belong here.
Can you explain this a bit more? I don't understnd what it modifies on the parser code.

Will have a look at those PR's 😄

@rmilecki
Copy link
Collaborator

I've clearly missed some requirements, it seems we have a bit different cases after all.

I'll spend some time analyzing this today and tomorrow and come back to you. Sorry for this confusion.

@rmilecki
Copy link
Collaborator

rmilecki commented Oct 1, 2022

Ok, going back to the original issue in #377 Parse Multiple lines in the same field.

When passing multiple line arguments in the template I won't get the desired outcome. Made an fictitious Example of an invoice which cannot be parsed. It's inspired by a real wholesaler invoice.

Thank you @bosd for this complete explanation of your case. I carefully read that and I believe I finally understand it fully.

By looking at your desired output it seems that you need to:

  1. Parse 1 specific part of invoice containing multiple lines
  2. Lines use two different formats

That syntax you proposed:

lines:
 - start: Barcode
   end:  Netto totaal  
   line: (?P<line_note>(---FOOD ITEMS---))
 - start: Barcode
   end:  Netto totaal  
   line: (?P<line_note>(---OTHER ITEMS---))
 - start: Barcode
   end:  Netto totaal
   line: (?P<barcode>(\d{13}))\s+(?P<name>(\w+(?:\s\S+)*))\s+(?P<qty>(\d))\s+(?P<uom>\w+)\s+(?P<price_unit>(\d+.\d+))\s+(?P<discount>\d+.\d+)

is quite confusing in my opinion. It's because you have there multiple (three) & repeated start and end entries. That suggests you want to parse 3 different blocks (parts) of invoice. As the same time you want them to be merged and put in the right order.

It's really hard for me to guess what should happen if you use start and end RegEx-es that match different parts of invoice. Parsing them would be fine but how to merge them in user-expected way?


It turns out it's already possible to parse your Mekro.pdf without any modifications to the project. I used:

fields:
  lines:
    parser: lines
    start: Barcode
    line: ((?P<barcode>(\d{13}))\s+(?P<name>(\w+(?:\s\S+)*))\s+(?P<qty>(\d))\s+(?P<uom>\w+)\s+(?P<price_unit>(\d+.\d+))\s+(?P<discount>\d+.\d+))|(?P<line_note>---(.*ITEMS)---)
    end: Netto totaal

(the trick is to use | in the lines RegEx).

That may not be perfect but gives output close to what you expected:

"lines": [
    {
        "barcode": "",
        "name": "",
        "qty": "",
        "uom": "",
        "price_unit": "",
        "discount": "",
        "line_note": "---FOOD ITEMS---"
    },
    {
        "barcode": "2231012001992",
        "name": "KROKETBROODJES",
        "qty": "2",
        "uom": "KG",
        "price_unit": "1,00",
        "discount": "0,0",
        "line_note": ""
    },
    {
        "barcode": "8713009019455",
        "name": "Oil",
        "qty": "3",
        "uom": "L",
        "price_unit": "0,50",
        "discount": "0,0",
        "line_note": ""
    },
    {
        "barcode": "8713009019475",
        "name": "Apple",
        "qty": "1",
        "uom": "KG",
        "price_unit": "50,0",
        "discount": "0,0",
        "line_note": ""
    },
    {
        "barcode": "",
        "name": "",
        "qty": "",
        "uom": "",
        "price_unit": "",
        "discount": "",
        "line_note": "---OTHER ITEMS---"
    },
    {
        "barcode": "8713009019375",
        "name": "programmer",
        "qty": "1",
        "uom": "Hour",
        "price_unit": "50,0",
        "discount": "100,0",
        "line_note": ""
    },
    {
        "barcode": "0013009019475",
        "name": "Sticker",
        "qty": "1",
        "uom": "pce",
        "price_unit": "0,0",
        "discount": "0,0",
        "line_note": ""
    }
]

I admit that using | in RegEx doesn't result in clean (easy to read) RegEx and it also produces unwated (empty) entries in output JSON. For that I think we could improve our lines parser.

I think we should just allow multiple entries in the line. That would result in:

  1. Simple implementation (just few extra lines of code)
  2. Clear syntax (no multiple start and end for parsing one block (part) of invoice)

@bosd
Copy link
Collaborator Author

bosd commented Oct 3, 2022

By looking at your desired output it seems that you need to:

Parse 1 specific part of invoice containing multiple lines
Lines use two different formats

I think you understand it correctly. The usage of the | as suggested here is interesting.
Yet, the result is not compatible with the code used to further process the results from this library.

Ooh, yeah, I see the multiple start and end's can be confusing.
(but I think is easy syntax, for converting a table style template builder as input. Yet, thats not implemented in code, so we can ignore that for now)

Kinda hard to put it into full spec.

  • Parse multiple line formats in order as they appear in the input
  • Still allow to contentate values if multiple regex rules match the field. (Do contentate untill a new firstline is received)
  • Output order should be independent from the order of the template. (it should take order from the input file) (nice to have)

(Contentate is not used in the above example)

I'll have a look at the mentioned pr

@bosd
Copy link
Collaborator Author

bosd commented Oct 21, 2022

Superseded by #417

@bosd bosd closed this Oct 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants